Lecture 8 addendum
==================

This note discusses two very technical points related to lecture 8.  The first
is a correction to my definition of LL(1); the second is a brief explanation of
the term "LL(1)".  You are not responsible for the material in this addendum.


Correction to definition of LL(1):
==================================

You will not be responsible for this, but I just wanted to correct
the record. There is a slight problem in the definition of LL(1)
that I gave in the notes.  Consider this grammar (this is, by the way,
from the Wikipedia entry for top-down parsing):

   S  ->  A a b
   A -> a  |  epsilon

According to our definition, this grammar is LL(1):  The right-hand sides
of the productions from A have FIRST(a) = { a } and FIRST(epsilon) =
{ bullet }, and these are non-overlapping.  However, this grammar
cannot be correctly parsed by recursive descent:  Given that the
first token in the input is a, parseA cannot reliably decide
which production to use:  If the input is aab, then it should
use A->a, but if it is ab, then A->epsilon is the correct production.  
Thus, we said this grammar is LL(1), but the parser we constructed
will not correctly parse it.

What is the problem?  When deciding what production to use in parseA,
we failed to take into account that, when the next input token is a,
both A->a and A->epsilon are valid productions.  This is akin to the
case where two right-hand sides have overlapping FIRST sets.

There is a way to account for this, by considering so-called FOLLOW
sets.  When we see this epsilon production, we should ask this
question:  Is there any string of grammar symbols derivable from S
in which A is immediately followed by a?  If so, then when we see
the a, we can't know whether to consume it (using the A -> a
production) or leave it alone (use the A -> epsilon production).
So in that case we should report that the grammar is not LL(1).

The set of tokens t that can follow a non-terminal A in some
string derived from the start symbol is called the follow set of A,
or FOLLOW(A). The algorithm for computing follow sets can be
found in any compiler textbook or online (but I don't want to take
the time to cover it in class).  The correct definition of LL(1)
takes these into account in determining how to deal with epsilon
productions.  The idea is just to add FOLLOW(A) to FIRST(A) if A is
nullable.  But I didn't want to get into this technicality, and you
will not be responsible for it.

Why LL(1)?
==========

A student asked what the significance of the name "LL(1)" is.  I didn't
want to take class time on this, because it's a bit complicated to
answer.  But I'll answer it here for those who are interested.  First,
the easy parts:  The first "L" refers to the fact that parsing occurs
during a left-to-right scan of the input;  the "1" means that parsing
decisions are based only on the next input token; if these decisions
were based on the next *two* input tokens, and then we would call
these grammars "LL(2)".

Now the hard part: the second L.  This stands for "leftmost derivation."
Again, you are not responsible for this, but for those who are curious,
here goes:  We need to think about how parse trees are constructed
during a top-down parse.  Instead of doing it the way I did it
in lecture 7, think of it this way:  Suppose there is a global parse
tree that we're constructing, and we keep adding to it whenever we
make a parsing decision.  That is, at any given time we are looking
to expand (that is, add a right-hand side as children of) a non-terminal
in the parse tree.  We start with the parse tree consisting of just
the start symbol S.  As soon as we decide which production from S to
use, we add its right-hand side as children of S.  We then handle
each of its symbols in turn, as described in class.  Each time we
make a decision about what production to use for a given non-terminal,
we add it to the tree.

For example, consider the grammar

A -> id |  ( B )
B -> A C
C -> + A C  |  epsilon

and input ( ( x ) ).

We start with tree A and immediately realize that we need to use
the second production for A, so we add (, B, and ) as children
of A:

A
   (
   B
   )

We match the first (, then call parseB.  parseB realizes - it has no
choices - that it has to use production B -> A C, so it adds that
to the tree:

A
   (
   B
     A
     C
   )

and then calls parseA.  parseA again decides on the production
A -> ( B ), and adds this to the tree:

A
   (
   B
     A
        (
        B
        )
     C
   )

It again matches the ( and calls parseB, which again expands the
tree

A
   (
   B
     A
        (
        B
           A
           C
        )
     C
   )

and calls parseA.  This time parseA matches the id and adds that:

A
   (
   B
     A
        (
        B
           A
              id
           C
        )
     C
   )

parseA returns to parseB, which calls C.  C doesn't see a +, so
it uses the epsilon production:

A
   (
   B
     A
        (
        B
           A
              id
           C
              epsilon
        )
     C
   )

We continue, eventually calling parseC again, which fills in the
final part of the parse tree:

A
   (
   B
     A
        (
        B
           A
              id
           C
              epsilon
        )
     C
        epsilon
   )

We're done.  Now let's write down the *frontiers* of these parse
trees at each step, (if you draw the above as actual trees, it's
easier to read off the frontiers):

A  =>  ( B )  =>  ( A C )  =>  ( ( B ) C )  =>  ( ( A C ) C )
   =>  ( ( id C ) C )  =>  ( ( id ) C )  =>  ( ( id ) ).

This is a derivation sequence, as defined in class today.
What is notable about this derivation sequence is that, at each
point, the *leftmost* non-terminal is the one that is replaced
by a right-hand side of a production.  It is characteristic of
top-down parsing that the order in which it decides on what
productions to use mimics a derivation with this property - a
so-called leftmost derivation.  The second "L" in "LL(1)"
stands for "leftmost derivation."